Are Linear Models Effective Predictors of Predator-Prey Interactions in Marine Food Webs?


Introduction

Marine fisheries are a healthy and abundant source of protein for human consumption1, as well as a rich source of fertilizer2 and fodder3 for terrestrial food production. As populations increase in the developing world4, it will become essential to maintain high quality fisheries to sustain the ever increasing caloric demands of the world’s population5. Therefore, it is important to understand as much as we can about how best to maintain and contribute to the health of populations of commercial fishery species.

Marine food webs are incredibly complex. Understanding how they work will improve our understanding of how best to maintain fish stocks. For example, in order to maintain stocks of large highly-prized species like Atlantic Cod and Bluefin Tuna, do we also need to worry about over-fishing their prey? We will ask if linear models are good predictors of the size of prey for a particular size of predator. If so, we can conclude not only how much, but of what sizes of prey species are needed to be maintained in order to keep stocks of larger predator species healthy.

Because marine species tend to have a wide array of novel morphological and behavioral differences, in comparing species, we should be careful to compare species with similar behaviors and ecological roles. In this study we will look at four species of predators with two distinct types of ecology.

First, we will look at two Benthopelagic species. Benthopelagic sharks, like squaloid sharks such as the Atlantic Spiny Dogfish, Squalus acanthias, achieve neutral buoyancy with the use of large oil-filled livers6. This allows them to adapt well to fairly high pressures and they can often be found on slopes down to about 2000 meters, scavenging on food falls such as dead whales. However, the energy demands of sharks are high since they need to swim constantly and maintain a large amount of oil for buoyancy. These energy needs cannot be met in the extreme oligotrophic conditions that occur at great depths, therefore they tend to stay in shallower locations closer to shore7.

Atlantic Spiny Dogfish, Squalus acanthias

Atlantic Spiny Dogfish, Squalus acanthias


Likewise, demersal fish such as the Atlantic Cod, Gadus morhua, which live and feed on or near the bottom of seas occupy sea floors which usually consist of mud, sand, gravel or rocks are usually found between 150 and 200 meters. They are omnivorous and feed on invertebrates and fish, including young cod8. (For this reason, we will only examine Adults in this study.)

They are both opportunistic feeders that forage on the bottom as well as in mid-water and near the surface, either alone or in hunting shoals9.

Atlantic Cod, Gadus morhua

Atlantic Cod, Gadus morhua


We will also look at two pelagic species which live in the North Atlantic Ocean pelagic zone, being neither close to the bottom nor near the shore – in contrast with demersal fish, which do live on or near the bottom, and reef fish, which are associated with coral reefs.

The Bluefish (Pomatomus saltatrix) is the only extant species of the family Pomatomidae. It is a marine pelagic fish found in temperate and subtropical waters. Bluefish commonly range in size from 18cm (7in) to much larger, sometimes weighing as much as 18kg (40lb), though fish heavier than 9kg are exceptional10. Adult bluefish are strong and aggressive, and live in loose groups. They are fast swimmers which prey on schools of forage fish, and continue attacking them in feeding frenzies even after they appear to have eaten their fill.11

Bluefish, Pomatomus saltatrix

Bluefish, Pomatomus saltatrix


Atlantic bluefin are native to both the western and eastern Atlantic Ocean, as well as the Mediterranean Sea. The Atlantic bluefin tuna is a close relative of the other two bluefin tuna species—the Pacific bluefin tuna and the southern bluefin tuna. Atlantic bluefin tuna may exceed 450kg (990lb) in weight. The head contains a “pineal window” that allows the fish to navigate over its multiple thousands of mile range. The Atlantic bluefin tuna typically hunts small fish such as sardines, herring and mackerel and invertebrates like squid and crustaceans.12

Western Atlantic Bluefin Tuna, Thunnus thynnus

Western Atlantic Bluefin Tuna, Thunnus thynnus


Though distinct in their morphology and only very distantly related as members of the order Perciformes (which includes over 10,000 species), both the Bluefish and Atlantic Bluefin Tuna fill similar ecological roles in similar regions of the Northern Atlantic Ocean with similar foraging behaviors.

By comparing two predator species each of the benthopelagic and pelagic types, we can not only draw conclusions about a particular species, but also determine if what we conclude is common to other species which inhabit the same ecological niche, while also being able to ask if there are commonalities shared between both benthopelagic and pelagic fish.

For our analysis, we used the Predator_and_prey_body_sizes_in_marine_food_webs_vsn4 data set from Ecological Archives E089-051-D1. for our analysis. The data set consists of 34931 records from 27 locations covering a wide range of environmental conditions from the tropics to the poles and for 93 types of predator with sizes ranging from 0.1 mg to over 415 kg and 174 prey types with sizes from 75 pg to over 4.5 kg. Each record includes: predator and prey scientific names, common names, taxa, life stages and sizes (length and mass with conversion details), plus the type of feeding interaction, geographic location (with habitat description, latitude, longitude) and mean annual environmental data (sea surface temperature and primary productivity).

Methods

Import the entire data set

Create the data frames

# Select only Adults in dataset
preyData <- droplevels(preyData[preyData$Predator.lifestage == "ADULT",])

# Select species for study per Introduction
n.atlantic.species <- preyData[(preyData$Predator %in% c("Squalus acanthias",
                                                          "Gadus morhua",
                                                          "Pomatomus saltatrix",
                                                          "Thunnus thynnus")),]

# Add Type factor
n.atlantic.species$Type <- ifelse(n.atlantic.species$Predator == "Squalus acanthias", "BENTHOPELAGIC"
                                  , ifelse(n.atlantic.species$Predator == "Gadus morhua", "BENTHOPELAGIC"
                                           , "PELAGIC"))
n.atlantic.species$Type <- as.factor(n.atlantic.species$Type)

# Tidy up all unused levels and columns
n.atlantic.species <- droplevels(n.atlantic.species)
n.atlantic.species <- subset(n.atlantic.species, select = c("Predator", "Prey", 
                                                           "Prey.taxon", "Feeding.interaction",
                                                           "Depth", "Diet.coverage", "Habitat",
                                                           "Latitude", "Longitude", "Location",
                                                           "Predator.length.SI", "Prey.length.SI",
                                                           "Predator.mass.SI","Prey.mass.SI", "Type"))

Results

As a starting point, we do some visual exploration of the data set; since we are focusing on predictors for Predator.length.SI, we chose the quantitative variables to look for interactions:

pairs(n.atlantic.species[, c(7, 11, 13, 12, 14)], col="darkorange")

While there seems to be a promising exponential relationship between length and mass, further research on the data set reveals that those variables are not obtained through observation, but instead estimated using agreed formulas. This poses a challenge to our original analysis, since it removes half of our quantitative factors for this analysis. We are left with just Prey.length.SI along with the categorical predictors in the data set.

Spiny Dogfish

Now we will look at the relationship between predator and prey length in the Spiny Dogfish:

species <- n.atlantic.species[n.atlantic.species$Predator == "Squalus acanthias",]

pairs(species[, c(7, 11, 13, 12, 14)], col="darkorange")

species.lm <- lm(Predator.length.SI ~ Prey.length.SI,
              data = species)

plot(Predator.length.SI ~ Prey.length.SI,
     main = "Predators vs. Prey",
     xlab = "Prey length (m)",
     ylab = "Predator length (m)",
     data = species,
     col  = "darkorange")

abline(species.lm,
       col="navy")

kable(model.summary <- glance(species.lm))
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.0422 0.0419 0.1416 144.8 0 2 1762 -3519 -3500 65.86 3285

The results of the test are significant, with a large \(F\)-statistic of 144.7938. However, are the assumptions of the model met? We need to examine the distribution of predator length data for the Spiny Dogfish to make sure it is normally distributed:

hist(species$Predator.length.SI,
     main = "Distribution of lengths of Spiny Dogfish prey items",
     col  = "navy",
     freq = FALSE,
     xlab = "Length (m)")

curve(dnorm(x, mean = mean(species$Predator.length.SI),
      sd  = sd(species$Predator.length.SI)), 
      col = "darkorange",
      add = TRUE)

shapiro.test(species$Predator.length.SI)
## 
##  Shapiro-Wilk normality test
## 
## data:  species$Predator.length.SI
## W = 0.95, p-value <2e-16

Visually, we can infer that the data is not normally distributed about the mean and the results of the Shapiro-Wilk normality test with a very small p-value mean that we should reject the null hypothesis that the data is normally distributed and accept the alternative hypothesis that it is not. Now we will now examine the variance:

hist(resid(species.lm),
     xlab   = "Residuals",
     main   = "Histogram of Residuals",
     freq   = FALSE,
     col    = "navy")

curve(dnorm(x, mean = mean(resid(species.lm)),
      sd  = sd(resid(species.lm))), 
      col = "darkorange",
      add = TRUE)

spreadLevelPlot(species.lm)

## 
## Suggested power transformation:  2.789
residplot(species.lm)

bptest(species.lm)
## 
##  studentized Breusch-Pagan test
## 
## data:  species.lm
## BP = 75, df = 1, p-value <2e-16

Looking at the fitted versus residuals plot it appears there is non-constant variance. Specifically, the variance decreases as the fitted value increases. We will need a transformation that accomplishes a variance stabilizing transformation. Using the Box-Cox method, we will justify a transformation other than log:

boxcox(species.lm, plotit = TRUE, lambda = seq(2.2, 2.6, by = 0.1))

species.lm.cox <- lm((((Predator.length.SI^2.4) - 1) / 2.4) ~ Prey.length.SI, data = species)

plot((((Predator.length.SI^2.4) - 1) / 2.4) ~ Prey.length.SI,
     main = "Predators vs. Prey",
     xlab = "Prey length (m)",
     ylab = "Predator length (m)",
     data = species,
     col  = "darkorange")

abline(species.lm.cox,
       col="navy")

summary(species.lm.cox)
## 
## Call:
## lm(formula = (((Predator.length.SI^2.4) - 1)/2.4) ~ Prey.length.SI, 
##     data = species)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2563 -0.0693 -0.0041  0.0768  0.2732 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.17219    0.00309   -55.7   <2e-16 ***
## Prey.length.SI  0.25243    0.02207    11.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0989 on 3285 degrees of freedom
## Multiple R-squared:  0.0383, Adjusted R-squared:  0.038 
## F-statistic:  131 on 1 and 3285 DF,  p-value: <2e-16

The model is significant, but we need to check our assumptions. First, we will check for normality:

plot.normal_QQ(species.lm.cox)

The distribution does not appear to be normal. To verify, we will perform a Shapiro-Wilk normality test:

shapiro.test(resid(species.lm.cox))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(species.lm.cox)
## W = 0.99, p-value = 1e-12

Because of the very low p-value, we must reject the null hypothesis that the data is normally distributed and accept the alternative hypothesis that the data is not normally distributed. Now we will check the variance:

hist(resid(species.lm.cox),
     xlab   = "Residuals",
     main   = "Histogram of Residuals",
     freq   = FALSE,
     col    = "navy")

curve(dnorm(x, mean = mean(resid(species.lm.cox)),
      sd  = sd(resid(species.lm.cox))), 
      col = "darkorange",
      add = TRUE)

residplot(species.lm.cox)

bptest(species.lm.cox )
## 
##  studentized Breusch-Pagan test
## 
## data:  species.lm.cox
## BP = 34, df = 1, p-value = 6e-09

From the results of the Breusch-Pagan test, we would reject the null hypothesis of homoscedasticity at \(\alpha = 0.01\), therefore the constant variance assumption is invalid.

However, since the normality assumption was violated, we will try to remove data points with high influence to see if the model is improved:

species.lm.cox.cd = cooks.distance(species.lm.cox)

species.lm.cox.fix = lm((((Predator.length.SI^2.4) - 1) / 2.4) ~ Prey.length.SI,
                     data = species,
                     subset = species.lm.cox.cd <= 4 / length(species.lm.cox.cd))

plot((((Predator.length.SI^2.4) - 1) / 2.4) ~ Prey.length.SI,
     main = "Predators vs. Prey",
     xlab = "Prey length (m)",
     ylab = "Predator length (m)",
     data = species,
     col  = "darkorange")

abline(species.lm.cox,
       col="navy")

abline(species.lm.cox.fix,
       col="red")

plot.normal_QQ(species.lm.cox.fix)

shapiro.test(resid(species.lm.cox.fix))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(species.lm.cox.fix)
## W = 0.99, p-value = 3e-14

From the results of the Shapiro-Wilk normality test, we would reject the null hypothesis that the distribution of the errors follow a normal distribution at \(\alpha = 0.01\), therefore the normality assumption is invalid.

hist(resid(species.lm.cox.fix),
     xlab   = "Residuals",
     main   = "Histogram of Residuals",
     freq   = FALSE,
     col    = "navy")

curve(dnorm(x, mean = mean(resid(species.lm.cox.fix)),
      sd  = sd(resid(species.lm.cox.fix))), 
      col = "darkorange",
      add = TRUE)

residplot(species.lm.cox.fix)

bptest(species.lm.cox.fix)
## 
##  studentized Breusch-Pagan test
## 
## data:  species.lm.cox.fix
## BP = 14, df = 1, p-value = 0.0002

From the results of the Breusch-Pagan test, we would reject the null hypothesis of homoscedasticity at \(\alpha = 0.01\), therefore the constant variance assumption is mathematically invalid, as visually there is clearly a tendency for the variance to increase with size.

In summary, we would reject this linear model for any purpose of explanation. From the residual plots, there is an indication of multimodality in the data, for which we don’t have the tools at our disposal to perform cluster analysis or any other method of separation or deconvolution. We therefore proceed to perform the same analysis for the next species.

Atlantic Cod

Now we will look at the relationship between predator and prey length in the Atlantic Cod:

species <- n.atlantic.species[n.atlantic.species$Predator == "Gadus morhua",]

pairs(species[, c(7, 11, 13, 12, 14)], col="darkorange")

species.lm <- lm(Predator.length.SI ~ Prey.length.SI,
              data = species)

plot(Predator.length.SI ~ Prey.length.SI,
     main = "Predators vs. Prey",
     xlab = "Prey length (m)",
     ylab = "Predator length (m)",
     data = species,
     col  = "darkorange")

abline(species.lm,
       col="navy")

kable(model.summary <- glance(species.lm))
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.1924 0.1921 0.1838 589.9 0 2 681.8 -1358 -1340 83.68 2476

The results of the test are significant, with a large \(F\)-statistic of 589.9161. However, are the assumptions of the model met? We need to examine the distribution of predator length data for the Atlantic Cod to make sure it is normally distributed:

hist(species$Predator.length.SI,
     main = "Distribution of lengths of Atlantic Cod prey items",
     col  = "navy",
     freq = FALSE,
     xlab = "Length (m)")

curve(dnorm(x, mean = mean(species$Predator.length.SI),
      sd  = sd(species$Predator.length.SI)), 
      col = "darkorange",
      add = TRUE)

shapiro.test(species$Predator.length.SI)
## 
##  Shapiro-Wilk normality test
## 
## data:  species$Predator.length.SI
## W = 0.99, p-value = 6e-14

Visually, we can infer that the data is not normally distributed about the mean and the results of the Shapiro-Wilk normality test with a very small p-value mean that we should reject the null hypothesis that the data is normally distributed and accept the alternative hypothesis that it is not. Now we will now examine the variance:

hist(resid(species.lm),
     xlab   = "Residuals",
     main   = "Histogram of Residuals",
     freq   = FALSE,
     col    = "navy")

curve(dnorm(x, mean = mean(resid(species.lm)),
      sd  = sd(resid(species.lm))), 
      col = "darkorange",
      add = TRUE)

spreadLevelPlot(species.lm)

## 
## Suggested power transformation:  0.9315
residplot(species.lm)

bptest(species.lm)
## 
##  studentized Breusch-Pagan test
## 
## data:  species.lm
## BP = 0.2, df = 1, p-value = 0.7

Looking at the fitted versus residuals plot it appears there is non-constant variance. Specifically, the variance increases as the fitted value increases. We will need a transformation that accomplishes a variance stabilizing transformation. Using the Box-Cox method, we will justify a transformation other than log:

boxcox(species.lm, plotit = TRUE, lambda = seq(0.4, 0.8, by = 0.1))

species.lm.cox <- lm((((Predator.length.SI^0.6) - 1) / 0.6) ~ Prey.length.SI, data = species)

plot((((Predator.length.SI^0.6) - 1) / 0.6) ~ Prey.length.SI,
     main = "Predators vs. Prey",
     xlab = "Prey length (m)",
     ylab = "Predator length (m)",
     data = species,
     col  = "darkorange")

abline(species.lm.cox,
       col="navy")

summary(species.lm.cox)
## 
## Call:
## lm(formula = (((Predator.length.SI^0.6) - 1)/0.6) ~ Prey.length.SI, 
##     data = species)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6464 -0.1510  0.0028  0.1505  0.7317 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.49465    0.00699   -70.7   <2e-16 ***
## Prey.length.SI  1.12101    0.04716    23.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.216 on 2476 degrees of freedom
## Multiple R-squared:  0.186,  Adjusted R-squared:  0.185 
## F-statistic:  565 on 1 and 2476 DF,  p-value: <2e-16

The model is significant, but we need to check our assumptions. First, we will check for normality:

plot.normal_QQ(species.lm.cox)

The distribution appears to close be normal. To verify, we will perform a Shapiro-Wilk normality test:

shapiro.test(resid(species.lm.cox))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(species.lm.cox)
## W = 1, p-value = 0.02
shapiro.result <- shapiro.test(resid(species.lm.cox))$p.value

Because of the p-value of 0.0208, we do not reject the null hypothesis that the data is normally distributed with \(\alpha=0.01\) and reject the alternative hypothesis that the data is not normally distributed. Now we will check the variance:

hist(resid(species.lm.cox),
     xlab   = "Residuals",
     main   = "Histogram of Residuals",
     freq   = FALSE,
     col    = "navy")

curve(dnorm(x, mean = mean(resid(species.lm.cox)),
      sd  = sd(resid(species.lm.cox))), 
      col = "darkorange",
      add = TRUE)

residplot(species.lm.cox)

bptest(species.lm.cox )
## 
##  studentized Breusch-Pagan test
## 
## data:  species.lm.cox
## BP = 6.7, df = 1, p-value = 0.01

From the results of the Breusch-Pagan test, we would fail to reject the null hypothesis of homoscedasticity at \(\alpha = 0.01\), therefore the constant variance assumption is valid.

We will see if we can improve the fit by removing points with high influence:

species.lm.cox.cd = cooks.distance(species.lm.cox)

species.lm.cox.fix = lm((((Predator.length.SI^0.6) - 1) / 0.6) ~ Prey.length.SI,
                     data = species,
                     subset = species.lm.cox.cd <= 4 / length(species.lm.cox.cd))

plot((((Predator.length.SI^0.6) - 1) / 0.6) ~ Prey.length.SI,
     main = "Predators vs. Prey",
     xlab = "Prey length (m)",
     ylab = "Predator length (m)",
     data = species,
     col  = "darkorange")

abline(species.lm.cox,
       col="navy")

abline(species.lm.cox.fix,
       col="red")

plot.normal_QQ(species.lm.cox.fix)

shapiro.test(resid(species.lm.cox.fix))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(species.lm.cox.fix)
## W = 1, p-value = 3e-06

From the results of the Shapiro-Wilk normality test, we would reject the null hypothesis that the distribution of the errors follow a normal distribution at \(\alpha = 0.01\), therefore the normality assumption is invalid.

hist(resid(species.lm.cox.fix),
     xlab   = "Residuals",
     main   = "Histogram of Residuals",
     freq   = FALSE,
     col    = "navy")

curve(dnorm(x, mean = mean(resid(species.lm.cox.fix)),
      sd  = sd(resid(species.lm.cox.fix))), 
      col = "darkorange",
      add = TRUE)

residplot(species.lm.cox.fix)

bptest(species.lm.cox.fix)
## 
##  studentized Breusch-Pagan test
## 
## data:  species.lm.cox.fix
## BP = 20, df = 1, p-value = 6e-06

From the results of the Breusch-Pagan test, we would reject the null hypothesis of homoscedasticity at \(\alpha = 0.01\), therefore the constant variance assumption is invalid.

Bluefish

Now we will look at the relationship between predator and prey length in Bluefish:

species <- n.atlantic.species[n.atlantic.species$Predator == "Pomatomus saltatrix",]

pairs(species[, c(7, 11, 13, 12, 14)], col="darkorange")

species.lm <- lm(Predator.length.SI ~ Prey.length.SI,
              data = species)

plot(Predator.length.SI ~ Prey.length.SI,
     main = "Predators vs. Prey",
     xlab = "Prey length (m)",
     ylab = "Predator length (m)",
     data = species,
     col  = "darkorange")

abline(species.lm,
       col="navy")

kable(model.summary <- glance(species.lm))
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.2566 0.256 0.2031 443.9 0 2 226.4 -446.9 -431.4 53.06 1286

The results of the test are significant, with a large \(F\)-statistic of 443.9229. However, are the assumptions of the model met? We need to examine the distribution of predator length data for the Bluefish to make sure it is normally distributed:

hist(species$Predator.length.SI,
     main = "Distribution of lengths of Bluefish prey items",
     col  = "navy",
     freq = FALSE,
     xlab = "Length (m)")

curve(dnorm(x, mean = mean(species$Predator.length.SI),
      sd  = sd(species$Predator.length.SI)), 
      col = "darkorange",
      add = TRUE)

shapiro.test(species$Predator.length.SI)
## 
##  Shapiro-Wilk normality test
## 
## data:  species$Predator.length.SI
## W = 0.91, p-value <2e-16

Visually, we can infer that the data is not normally distributed about the mean and the results of the Shapiro-Wilk normality test with a very small p-value mean that we should reject the null hypothesis that the data is normally distributed and accept the alternative hypothesis that it is not. Again, like in the case of the dogfish, we are looking at a bad case of data multimodality. Now we will now examine the variance:

hist(resid(species.lm),
     xlab   = "Residuals",
     main   = "Histogram of Residuals",
     freq   = FALSE,
     col    = "navy")

curve(dnorm(x, mean = mean(resid(species.lm)),
      sd  = sd(resid(species.lm))), 
      col = "darkorange",
      add = TRUE)

spreadLevelPlot(species.lm)

## 
## Suggested power transformation:  1.486
residplot(species.lm)

bptest(species.lm)
## 
##  studentized Breusch-Pagan test
## 
## data:  species.lm
## BP = 4.7, df = 1, p-value = 0.03

Looking at the fitted versus residuals plot it is clear that there is non-constant variance. Specifically, the variance increases as the fitted value increases. We will need a transformation that accomplishes a variance stabilizing transformation. Using the Box-Cox method, we will justify a transformation other than log:

boxcox(species.lm, plotit = TRUE, lambda = seq(0.2, 0.5, by = 0.1))

species.lm.cox <- lm((((Predator.length.SI^0.36) - 1) / 0.36) ~ Prey.length.SI, data = species)

plot((((Predator.length.SI^0.36) - 1) / 0.36) ~ Prey.length.SI,
     main = "Predators vs. Prey",
     xlab = "Prey length (m)",
     ylab = "Predator length (m)",
     data = species,
     col  = "darkorange")

abline(species.lm.cox,
       col="navy")

summary(species.lm.cox)
## 
## Call:
## lm(formula = (((Predator.length.SI^0.36) - 1)/0.36) ~ Prey.length.SI, 
##     data = species)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3756 -0.3304 -0.0682  0.3373  0.9077 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.2011     0.0179   -67.1   <2e-16 ***
## Prey.length.SI   3.9512     0.1869    21.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.404 on 1286 degrees of freedom
## Multiple R-squared:  0.258,  Adjusted R-squared:  0.257 
## F-statistic:  447 on 1 and 1286 DF,  p-value: <2e-16

The model is significant, but we need to check our assumptions. First, we will check for normality:

plot.normal_QQ(species.lm.cox)

The distribution does not appear to be normal. To verify, we will perform a Shapiro-Wilk normality test:

shapiro.test(resid(species.lm.cox))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(species.lm.cox)
## W = 0.96, p-value <2e-16

Because of the very low p-value, we must reject the null hypothesis that the data is normally distributed and accept the alternative hypothesis that the data is not normally distributed. Now we will check the variance:

hist(resid(species.lm.cox),
     xlab   = "Residuals",
     main   = "Histogram of Residuals",
     freq   = FALSE,
     col    = "navy")

curve(dnorm(x, mean = mean(resid(species.lm.cox)),
      sd  = sd(resid(species.lm.cox))), 
      col = "darkorange",
      add = TRUE)

residplot(species.lm.cox)

bptest(species.lm.cox )
## 
##  studentized Breusch-Pagan test
## 
## data:  species.lm.cox
## BP = 22, df = 1, p-value = 3e-06

From the results of the very low p-value from the Breusch-Pagan test, we would reject the null hypothesis of homoscedasticity, therefore the constant variance assumption is violated. We will try to remove data points with high influence to see if the model is improved:

species.lm.cox.cd = cooks.distance(species.lm.cox)

species.lm.cox.fix = lm((((Predator.length.SI^0.36) - 1) / 0.36) ~ Prey.length.SI,
                     data = species,
                     subset = species.lm.cox.cd <= 4 / length(species.lm.cox.cd))

plot((((Predator.length.SI^0.36) - 1) / 0.36) ~ Prey.length.SI,
     main = "Predators vs. Prey",
     xlab = "Prey length (m)",
     ylab = "Predator length (m)",
     data = species,
     col  = "darkorange")

abline(species.lm.cox,
       col="navy")

abline(species.lm.cox.fix,
       col="red")

plot.normal_QQ(species.lm.cox.fix)

shapiro.test(resid(species.lm.cox.fix))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(species.lm.cox.fix)
## W = 0.95, p-value <2e-16

From the results of the Shapiro-Wilk normality test, we would reject the null hypothesis that the distribution of the errors follow a normal distribution at \(\alpha = 0.01\), therefore the normality assumption is invalid.

hist(resid(species.lm.cox.fix),
     xlab   = "Residuals",
     main   = "Histogram of Residuals",
     freq   = FALSE,
     col    = "navy")

curve(dnorm(x, mean = mean(resid(species.lm.cox.fix)),
      sd  = sd(resid(species.lm.cox.fix))), 
      col = "darkorange",
      add = TRUE)

spreadLevelPlot(species.lm.cox.fix)

## 
## Suggested power transformation:  0.8922
residplot(species.lm.cox.fix)

bptest(species.lm.cox.fix)
## 
##  studentized Breusch-Pagan test
## 
## data:  species.lm.cox.fix
## BP = 58, df = 1, p-value = 2e-14

From the results of the Breusch-Pagan test, we would reject the null hypothesis of homoscedasticity \(\alpha = 0.01\), therefore the constant variance assumption is violated.

Again, we would reject this linear model for any purpose of explanation. From the residual plots, there is an indication of multimodality in the data, for which we don’t have the tools at our disposal to perform cluster analysis or any other method of separation or deconvolution. We therefore proceed to perform the same analysis for the next species.

Western Atlantic Bluefin Tuna

Finally we will look at the relationship between predator and prey length in the Western Atlantic Bluefin Tuna:

species <- n.atlantic.species[n.atlantic.species$Predator == "Thunnus thynnus",]

pairs(species[, c(7, 11, 13, 12, 14)], col = "darkorange")

species.lm <- lm(Predator.length.SI ~ Prey.length.SI,
              data = species)

plot(Predator.length.SI ~ Prey.length.SI,
     main = "Predators vs. Prey",
     xlab = "Prey length (m)",
     ylab = "Predator length (m)",
     data = species,
     col  = "darkorange")

abline(species.lm,
       col="navy")

kable(model.summary <- glance(species.lm))
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.1525 0.1521 0.4179 343.3 0 2 -1042 2090 2107 333.1 1907

The results are significant. However, are the assumptions of the model met? We need to examine the distribution of prey length data for the Atlantic Bluefin Tuna to make sure it is normally distributed:

hist(species$Predator.length.SI,
     main = "Distribution of lengths of Atlantic Bluefin Tuna prey items",
     col  = "navy",
     freq = FALSE,
     xlab = "Length (m)")

curve(dnorm(x, mean = mean(species$Predator.length.SI),
      sd  = sd(species$Predator.length.SI)), 
      col = "darkorange",
      add = TRUE)

shapiro.test(species$Predator.length.SI)
## 
##  Shapiro-Wilk normality test
## 
## data:  species$Predator.length.SI
## W = 0.97, p-value <2e-16

Visually, we can infer that the data is not normally distributed about the mean and the results of the Shapiro-Wilk normality test with a very small p-value mean that we should reject the null hypothesis that the data is normally distributed and accept the alternative hypothesis that it is not. Now we will now examine the variance:

hist(resid(species.lm),
     xlab   = "Residuals",
     main   = "Histogram of Residuals",
     freq   = FALSE,
     col    = "navy")

curve(dnorm(x, mean = mean(resid(species.lm)),
      sd  = sd(resid(species.lm))), 
      col = "darkorange",
      add = TRUE)

spreadLevelPlot(species.lm)

## 
## Suggested power transformation:  2.877
residplot(species.lm)

bptest(species.lm)
## 
##  studentized Breusch-Pagan test
## 
## data:  species.lm
## BP = 11, df = 1, p-value = 0.0007

Looking at the fitted versus residuals plot it appears there is non-constant variance. Specifically, the variance increases as the fitted value increases. We will need a transformation that accomplishes a variance stabilizing transformation. Using the Box-Cox method, we will justify a transformation other than log:

boxcox(species.lm, plotit = TRUE, lambda = seq(1.1, 1.5, by = 0.1))

species.lm.cox <- lm((((Predator.length.SI^1.3) - 1) / 1.3) ~ Prey.length.SI, data = species)

plot((((Predator.length.SI^1.3) - 1) / 1.3) ~ Prey.length.SI,
     main = "Predators vs. Prey",
     xlab = "Prey length (m)",
     ylab = "Predator length (m)",
     data = species,
     col  = "darkorange")

abline(species.lm.cox,
       col="navy")

summary(species.lm.cox)
## 
## Call:
## lm(formula = (((Predator.length.SI^1.3) - 1)/1.3) ~ Prey.length.SI, 
##     data = species)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1732 -0.3503 -0.0035  0.4325  1.3046 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.6703     0.0304    22.1   <2e-16 ***
## Prey.length.SI   2.9677     0.1603    18.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.512 on 1907 degrees of freedom
## Multiple R-squared:  0.152,  Adjusted R-squared:  0.152 
## F-statistic:  343 on 1 and 1907 DF,  p-value: <2e-16

The model is significant, but we need to check our assumptions. First, we will check for normality:

plot.normal_QQ(species.lm.cox)

The distribution does not appear to be normal. To verify, we will perform a Shapiro-Wilk normality test:

shapiro.test(resid(species.lm.cox))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(species.lm.cox)
## W = 0.98, p-value <2e-16

Because of the very low p-value, we must reject the null hypothesis that the data is normally distributed and accept the alternative hypothesis that the data is not normally distributed. Now we will check the variance:

hist(resid(species.lm.cox),
     xlab   = "Residuals",
     main   = "Histogram of Residuals",
     freq   = FALSE,
     col    = "navy")

curve(dnorm(x, mean = mean(resid(species.lm.cox)),
      sd  = sd(resid(species.lm.cox))), 
      col = "darkorange",
      add = TRUE)

residplot(species.lm.cox)

bptest(species.lm.cox )
## 
##  studentized Breusch-Pagan test
## 
## data:  species.lm.cox
## BP = 8.1, df = 1, p-value = 0.004

From the results of the low p-value from the Breusch-Pagan test, we would reject the null hypothesis of homoscedasticity, therefore the constant variance assumption is violated. We will try to remove data points with high influence to see if the model is improved:

species.lm.cox.cd = cooks.distance(species.lm.cox)

species.lm.cox.fix = lm((((Predator.length.SI^2.4) - 1) / 2.4) ~ Prey.length.SI,
                     data = species,
                     subset = species.lm.cox.cd <= 4 / length(species.lm.cox.cd))

plot((((Predator.length.SI^2.4) - 1) / 2.4) ~ Prey.length.SI,
     main = "Predators vs. Prey",
     xlab = "Prey length (m)",
     ylab = "Predator length (m)",
     data = species,
     col  = "darkorange")

abline(species.lm.cox,
       col="navy")

abline(species.lm.cox.fix,
       col="red")

plot.normal_QQ(species.lm.cox.fix)

shapiro.test(resid(species.lm.cox.fix))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(species.lm.cox.fix)
## W = 0.95, p-value <2e-16

From the results of the Shapiro-Wilk normality test, we would reject the null hypothesis that the distribution of the errors follow a normal distribution at \(\alpha = 0.01\), therefore the normality assumption is invalid.

hist(resid(species.lm.cox.fix),
     xlab   = "Residuals",
     main   = "Histogram of Residuals",
     freq   = FALSE,
     col    = "navy")

curve(dnorm(x, mean = mean(resid(species.lm.cox.fix)),
      sd  = sd(resid(species.lm.cox.fix))), 
      col = "darkorange",
      add = TRUE)

spreadLevelPlot(species.lm.cox.fix)

## 
## Suggested power transformation:  0.8433
residplot(species.lm.cox.fix)

bptest(species.lm.cox.fix)
## 
##  studentized Breusch-Pagan test
## 
## data:  species.lm.cox.fix
## BP = 0.22, df = 1, p-value = 0.6

From the results of the Breusch-Pagan test, we would fail to reject the null hypothesis of homoscedasticity at \(\alpha = 0.01\), therefore the constant variance assumption is valid.

summary(species.lm.cox.fix)
## 
## Call:
## lm(formula = (((Predator.length.SI^2.4) - 1)/2.4) ~ Prey.length.SI, 
##     data = species, subset = species.lm.cox.cd <= 4/length(species.lm.cox.cd))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.293 -0.887 -0.171  0.828  2.900 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.4858     0.0763    6.36  2.5e-10 ***
## Prey.length.SI   9.0534     0.4186   21.63  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.07 on 1864 degrees of freedom
## Multiple R-squared:  0.201,  Adjusted R-squared:   0.2 
## F-statistic:  468 on 1 and 1864 DF,  p-value: <2e-16

The linear model is better than the rest in our set, but can it be improved? We will add factors and see:

species.lm.cox.fix.add <- lm((((Predator.length.SI^2.4) - 1) / 2.4) ~ Predator.length.SI + Depth + Location,
                     data = species,
                     subset = species.lm.cox.cd <= 4 / length(species.lm.cox.cd))

summary(species.lm.cox.fix.add)
## 
## Call:
## lm(formula = (((Predator.length.SI^2.4) - 1)/2.4) ~ Predator.length.SI + 
##     Depth + Location, data = species, subset = species.lm.cox.cd <= 
##     4/length(species.lm.cox.cd))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4396 -0.1252 -0.0434  0.1099  0.6118 
## 
## Coefficients: (1 not defined because of singularities)
##                                                          Estimate Std. Error
## (Intercept)                                             -3.375499   0.039458
## Predator.length.SI                                       2.690724   0.010164
## Depth                                                   -0.001199   0.000499
## LocationGreat Channel South, Gulf of Maine, New England  0.013113   0.025579
## LocationGulf of Maine, New England                       0.363969   0.071150
## LocationJeffreys Ledge, Gulf of Maine, New England      -0.020147   0.026959
## LocationSouth of Marthas Vineyard, New England           0.483140   0.028174
## LocationStellwagen Bank, Gulf of Maine, New England            NA         NA
##                                                         t value Pr(>|t|)    
## (Intercept)                                              -85.55  < 2e-16 ***
## Predator.length.SI                                       264.74  < 2e-16 ***
## Depth                                                     -2.40    0.016 *  
## LocationGreat Channel South, Gulf of Maine, New England    0.51    0.608    
## LocationGulf of Maine, New England                         5.12  3.5e-07 ***
## LocationJeffreys Ledge, Gulf of Maine, New England        -0.75    0.455    
## LocationSouth of Marthas Vineyard, New England            17.15  < 2e-16 ***
## LocationStellwagen Bank, Gulf of Maine, New England          NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.164 on 1859 degrees of freedom
## Multiple R-squared:  0.981,  Adjusted R-squared:  0.981 
## F-statistic: 1.62e+04 on 6 and 1859 DF,  p-value: <2e-16

The model is significant and has improved significantly, although see that Stellwagen Bank is disregarded from the model. We check easily that the additive model has a much lower RMSE (0.1638) than the original model (rmse(species.lm.cox.fix))

Thus, let’s check whether the difference between the models significant

anova(species.lm.cox.fix.add, species.lm.cox.fix)
## Analysis of Variance Table
## 
## Model 1: (((Predator.length.SI^2.4) - 1)/2.4) ~ Predator.length.SI + Depth + 
##     Location
## Model 2: (((Predator.length.SI^2.4) - 1)/2.4) ~ Prey.length.SI
##   Res.Df  RSS Df Sum of Sq     F Pr(>F)    
## 1   1859   50                              
## 2   1864 2128 -5     -2078 15424 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

At \(\alpha = 0.01\), we would reject the null hypothesis and conclude that there is a significant difference between these models, and conclude that the additive model is preferable.

Discussion

We analyzed data collected from several locations in the North Atlantic ocean comparing how well linear models worked for four different species that filled two distinctly different ecological niches. The following map illustrates the geographic sources of our data:

map(database = "world",
      interior = FALSE,
      fill = TRUE,
      resolution = 0,
      col="gray90",
      xlim=c(min(n.atlantic.species$Longitude) - 3, max(n.atlantic.species$Longitude) + 3),
      ylim=c(min(n.atlantic.species$Latitude) - 3, max(n.atlantic.species$Latitude) + 3))

dogfish <- n.atlantic.species[n.atlantic.species$Predator == "Squalus acanthias",]
cod <- n.atlantic.species[n.atlantic.species$Predator == "Gadus morhua",]
bluefish <- n.atlantic.species[n.atlantic.species$Predator == "Pomatomus saltatrix",]
tuna <- n.atlantic.species[n.atlantic.species$Predator == "Thunnus thynnus",]


points(unique(dogfish$Longitude),
         unique(dogfish$Latitude),
         pch=1,
         col="brown",
         cex=2.0)

points(unique(cod$Longitude),
         unique(cod$Latitude),
         pch=2,
         col="red",
         cex=1.0)

points(unique(bluefish$Longitude),
       unique(bluefish$Latitude),
       pch=5,
       col="blue",
       cex=1.0)

points(unique(tuna$Longitude),
       unique(tuna$Latitude),
       pch=0,
       col="green",
       cex=1.0)

legend("bottomright",
       legend = c("Spiny Dogfish", "Atlantic Cod", "Bluefish", "Bluefin Tuna"),
       pch = c(1,2,5,0),
       col = c("brown","red","blue","green"))

As demonstrated on the map, there was overlap between three of the species, yet the linear model clearly was a better fit for the Bluefish than the Spiny Dogfish or the Atlantic Cod. Therefore, location does not seem to be a good explanation for the differences.

What we observed is that the linear model does not work well for benthopalegic species, but does seem to work moderately well for pelagic species, and very well for the Bluefin Tuna in particular.

Why doesn’t it work for benthopalegic fish? Likely it is a trick of biology, or to be more specific, behavior. The benthopalegic species have a more sedentary lifestyle and although they are highly opportunistic, they are infrequent feeders compared to aggressive pelagic fish which not only need to constantly feed in order to maintain their high energy requirements, but are active hunters rather than opportunists.

But, why would Bluefin Tuna fit the model so much better than Bluefish? Bluefin Tuna have a very efficient circulatory system. It possesses one of the highest blood hemoglobin concentrations among fish, which allows it to efficiently deliver oxygen to its tissues; this is combined with an exceptionally thin blood-water barrier to ensure rapid oxygen uptake. To keep its core muscles warm, the Bluefin uses countercurrent exchange to prevent heat from being lost to the surrounding water13. While all members of the tuna family are warm-blooded, the ability to thermoregulate is more highly developed in Bluefin Tuna than in any other fish14. The Bluefish has no such adaptations, and is a far less efficient predator. Bluefish are only very distantly related to Tuna, and have evolved their pelagic survival strategies distinctly from the Bluefin Tuna.

The simple linear model fits species with certain behavioral and physiological profiles, but not others.

Dataset Commentary

Despite the fact that we were able to find a model that would fit comparatively better, it is still unsatisfactory since it is only valid for a subset of the data. When we started the project, our goal was to find a data set that was well-curated, has a large number of observations and variables, and a good range of quantitative predictors that we would use to fit our model. We settled down on one data set that seemed to address all the requirements above.

Through the data cleaning process, we realize how far from our expectations was our choice of data set:

Appendix


  1. https://www.ncbi.nlm.nih.gov/pubmed/11790957

  2. http://www.ecori.org/green-economy/2015/1/15/reuse-of-fish-waste-yields-organic-liquid-fertilizer

  3. http://www.fao.org/docrep/003/w3647e/W3647E07.htm

  4. http://www.dw.com/en/population-growth-challenges-poor-nations/a-17010329

  5. http://www.who.int/nutrition/topics/3_foodconsumption/en/index4.html

  6. Bone 2008, p. 42.

  7. Bone 2008, p. 42.

  8. Steneck, R. S. (14 May 2012). “Apex predators and trophic cascades in large marine ecosystems: Learning from serendipity”. Proceedings of the National Academy of Sciences. 109 (21): 7953–7954. doi:10.1073/pnas.1205591109

  9. Pitcher, TJ, Parrish JK (1993). Functions of shoaling behaviour in teleosts. Chapman and Hall. pp. 363–427.

  10. https://en.wikipedia.org/wiki/Bluefish

  11. Froese, Rainer and Pauly, Daniel, eds. (2006). “Pomatomus saltatrix” in FishBase. March 2006 version.

  12. https://en.wikipedia.org/wiki/Atlantic_bluefin_tuna

  13. Hill, Richard W.; Gordon A. Wyse; Margaret Anderson (2004). Animal Physiology. Sinauer Associates, Inc. ISBN 0-87893-315-8

  14. Greenberg, Paul (27 June 2010). “Tuna’s End”. The New York Times. p. 28